tokenization tutorial
Data Science 101 (Getting started in NLP): Tokenization tutorial
One common task in NLP (Natural Language Processing) is tokenization. "Tokens" are usually individual words (at least in languages like English) and "tokenization" is taking a text or set of text and breaking it up into its individual words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words). In this tutorial you'll learn how to: For this tutorial we'll be using a corpus of transcribed speech from bilingual children speaking in English. You can find more information on this dataset and download it here.